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Abstract 


This  report  presents  an  overview  of  an  explicit  message-passing  paradigm 
for  an  Eulerian  finite  volume  method  for  modeling  soUd  dynamics  problems 
involving  shock  wave  propagation,  multiple  materials,  and  large 
deformations.  Three-dimensional  simulations  of  high  velocity  impact  were 
conducted  on  the  Sim  HPC  10000  computer  system.  The  scalability  of  the 
message-passing  code  on  this  symmetrical  multiple  processor  architecture  is 
presented  and  is  compared  to  the  ideal  linear  multiple  processor 
performance.  The  computed  results  are  also  compared  to  experimental  data 
for  the  purpose  of  validating  the  shock  physics  application  on  the  Sun  HPC 
10000  system. 
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1.  INTRODUCTION 


The  mechanics  of  penetration  and  perforation  of  solids  have  long  been  of  interest  for 
military  applications  in  terminal  ballistics.  Kinetic  energy  penetration  phenomena  are  also 
germane  to  applications  involving  high-mass  and  high-velocity  debris  attributable  to  acci¬ 
dents  or  high-rate  energy  release,  the  transportation  safety  of  hazardous  materials,  the  safety 
of  nuclear  reactor  containment  vessels,  the  design  of  lightweight  body  armors,  the  erosion  and 
fracture  of  solids  because  of  repeated  impacts  by  liquid  or  solid  particles,  and  the  protection 
of  spacecraft  from  meteoroid  impact.  A  thorough  review  of  the  fundamentals  of  penetration 
and  perforation  and  their  application  to  practical  problems  has  been  prepared  by  Goldsmith 
(1960),  Johnson  (1972),  Backman  and  Goldsmith  (1978),  and  Zukas  et  al.  (1982,  1990). 

Analytical  approaches  to  penetration  mechanics  tend  to  fall  into  three  categories:  em¬ 
pirical  or  quasi-analytical,  approximate  analytical,  and  numerical  methods.  While  empirical 
and  approximate  analytical  methods  are  quite  useful  for  developing  an  appreciation  of  the 
dominant  physical  phenomena,  they  are  limited  in  scope.  Numerical  methods  provide  a 
complete  description  of  the  dynamics  of  impacting  solids,  which  account  for  the  geometry 
of  the  interacting  bodies;  elastic,  plastic  and  shock  wave  propagation;  hydrodynamic  flow; 
finite  strains  and  deformations;  high  strain  rate  material  behavior;  and  the  initiation  and 
propagation  of  failure  in  the  colliding  bodies.  Computer  codes  for  modeling  wave  propaga¬ 
tion  and  impact  have  matured  considerably  since  their  initial  development  about  45  years 
ago.  Today  they  serve  as  valuable  tools  in  studies  of  materials  and  structures  subjected  to 
intense  impulsive  loading.  Benson  (1992)  recently  documented  a  comprehensive  review  of 
the  physics  and  numerics  in  wave  propagation  codes. 

Three-dimensional  (3-D)  simulations  of  high-velocity  impact  phenomena  continue  to 
delineate  the  high  performance  computing  resources  for  Army  applications  in  terminal  bal¬ 
listics.  Current  applications  in  high-velocity  impact  phenomena  require  the  simulation  time 
to  increase  from  the  microsecond  to  millisecond  regime,  and  complex  geometries  dictate  a 
finer  mesh  resolution  which  mandates  a  smaller  time  integration  increment  to  satisfy  sta¬ 
bility  criteria  and  additional  time  integration  cycles.  For  a  given  Eulerian  computational 
domain,  memory  requirements  scale  inversely  with  the  cube  of  the  zone  size  and  processor 
requirements  scale  to  the  fourth  power  as  the  mesh  is  refined  with  smaller  zones.  These 
factors,  when  coupled  with  the  requirement  to  model  larger  physical  domains,  are  strong 
stimuli  for  exploiting  scalable  architectures  and  algorithms. 

Under  the  aegis  of  the  Department  of  Defense  (DoD)  high  performance  computing  (HPC) 
modernization  program  (Jones  1996),  DoD  researchers  are  afforded  access  to  scalable  HPC 
resources.  The  successful  use  of  scalable  architectures  for  large-scale  simulations  of  high- 
velocity  impact  requires  reliable  and  robust  scalable  applications  algorithms.  The  common 
HPC  software  support  initiative  (CHSSI)  component  of  the  DoD  HPC  modernization  pro¬ 
gram  addresses  the  development,  validation,  and  demonstration  of  scalable  software  in  a 
number  of  defense  computational  technology  areas.  This  report  presents  an  overview  of  an 
explicit  message-passing  paradigm  for  applications  in  shock  physics.  Scalable  performance 
of  a  3-D  oblique  rod  impact  are  presented  for  the  Sun  HPC  10000  system,  a  new  symmetric 
multiple  processor  (SMP)  architecture. 
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2.  SCALABLE  PARADIGM  FOR  IMPACT  PROBLEMS 


CTH  (McGlaun  &  Thompson  1990)  is  an  Eulerian  finite  volume  code  for  modeling 
solid  dynamics  problems  involving  shock  wave  propagation,  multiple  materials,  and  large 
deformations  in  one,  two,  and  three  dimensions.  CTH  is  widely  used  across  the  defense 
research  and  development  community  to  model  problems  in  shock  wave  propagation.  CTH 
employs  a  two-step  solution  scheme  -  a  Lagrangian  step  followed  by  a  remap  step.  The 
conservation  equations  are  replaced  by  explicit  finite  volume  equations  that  are  solved  in 
the  Lagrangian  step.  The  remap  step  uses  operator  splitting  techniques  to  replace  multi¬ 
dimensional  equations  with  a  set  of  one  dimensional  equations.  The  remap  or  advection 
step  is  based  on  a  second  order  accurate  method  by  van  Leer  (1977).  To  minimize  material 
dispersion,  several  high  resolution  material  interface  trackers  are  available.  Both  analytical 
and  tabular  equations  of  state  are  available  to  model  the  hydrodynamic  behavior  of  materials. 
Models  for  elastic-plastic  behavior  and  high  explosive  detonation  are  also  available. 

Robinson  et  al.  (1992)  developed  the  algorithmic  framework  for  conducting  scalable  Eu¬ 
lerian  finite  volume  simulations  for  modeling  problems  in  solid  dynamics,  based  on  object- 
oriented  programming.  Robinson  demonstrated  that  the  structured  mesh  of  the  Eulerian 
finite  volume  method  is  well  suited  for  scalable  paradigms  employing  message  passing  be¬ 
tween  computational  sub-domains. 

Scalable  computer  architectures  are  characterized  by  a  large  number  of  computational 
nodes  consisting  of  memory,  one  or  more  commodity  processors,  and  an  internal  communica¬ 
tions  network.  One  computing  technique  that  can  be  employed  on  this  type  of  architecture  is 
referred  to  as  single  program  multiple  data  (SPMD).  Under  the  SPMD  paradigm,  the  same 
executable  code  runs  on  each  computational  node,  but  each  executable  works  on  a  different 
set  of  data.  Algorithms  that  depend  on  a  fixed,  logically  connected  mesh  are  readily  adapted 
to  the  SPMD  paradigm.  The  technique  used  for  SPMD  parallelism  in  CTH  is  similar  to  the 
formulation  developed  by  Robinson  et  al.  (1992)  in  that  the  entire  problem  domain  is  divided 
into  sub-domains  that  reside  on  individual  computational  nodes. 

The  use  of  “ghost”  cells  is  a  common  technique  for  applying  boundary  conditions  to 
finite  difference  and  finite  volume  schemes,  making  the  internal  differencing  computations 
independent  of  edges  and  corners  in  the  Eulerian  mesh.  To  adapt  CTH  to  the  SPMD 
paradigm,  these  ghost  cells  are  used  for  passing  messages  between  nodes.  This  practice  of 
explicit  message  passing  between  sub-domains  allows  each  of  the  individual  sub-domains 
to  have  access  to  its  neighboring  sub-domains’  boundary  cell  data.  Where  a  sub-domain 
boundary  is  an  external  boundary  of  the  overall  computational  domain,  the  ghost  cell  data 
are  based  on  the  appropriate  boundary  condition  approximation.  A  simple  example  of  this 
approach  to  mesh  decomposition  with  explicit  message  passing  is  provided  in  Figure  1.  A 
thorough  description  of  the  distributed  finite  volume  algorithm  and  message  communication 
between  sub-domains  is  provided  by  Kimsey  et  al.  (1998). 
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Figure  1.  CTH  Mesh  Decomposition  With  Explicit  Message  Passing 

3.  SUN  HPC  10000  SYSTEM 

The  scalability  and  performance  trials  described  in  this  report  focus  on  a  relatively  new 
entry  into  the  HPC  arena.  The  Sun  HPC  10000  architecture  is  a  unified  memory  access 
(UMA)  symmetric  multi-processor.  It  can  hold  as  many  as  64  UltraSPARC  processors  and 
64  gigabytes  (GB)  of  main  memory.  The  system  uses  a  crossbar  interconnect  for  inter- 
processor  communication  and  has  an  overall  system  bandwidth  of  12.8  GB/s  with  memory 
latencies  of  400  to  600  ns  (Sun  Microsystems  1997).  The  UMA  architecture  is  designed  so 
that  any  processor  may  access  data  in  any  segment  of  memory  in  the  same  amount  of  time, 
regardless  of  the  relative  locations  of  the  processor  and  segment  of  memory  in  question. 

The  scalability  study  was  performed  with  a  set  of  HPC  10000  systems  operated  by  the 
U.S.  Army  Research  Laboratory  (ARL)  Major  Shared  Resource  Center  (MSRC).  As  of  this 
writing,  the  MSRC  has  three  systems  available  to  support  unclassified  processing  for  defense 
science  and  technology  programs.  Two  of  these  systems  each  have  64  processors  and  64  GB 
of  main  memory.  The  third  system  has  32  processors  and  32  GB  of  main  memory.  All  the 
systems  are  equipped  with  UltraSPARC  processors  operating  at  a  clock  speed  of  400  MHz. 

Each  of  the  systems  contains  several  high-speed  network  interfaces.  One  of  these,  the 
asynchronous  transfer  mode  (ATM)  interface,  has  a  peak  bandwidth  of  622  Mb/s  and  was 
used  in  trials  in  which  the  simulations  were  distributed  across  two  of  the  three  systems.  The 
message-passing  interface  (MPI)  (Gropp  et  al.  1994)  was  used  for  explicit  message  passing 
between  processes  (Snir  et  al.  1995;  Sun  Microsystems  November  1997). 


4.  SCALABLE  HIGH-VELOCITY  IMPACT  SIMULATIONS 

CTH  with  explicit  message  passing  has  been  used  to  model  a  long  rod  projectile  impact¬ 
ing  an  oblique  steel  plate  on  the  Sun  HPC  10000  architecture.  This  problem  was  selected 
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because  of  well-characterized  experimental  data  (Fugelso  &:  Taylor  1978)  and  previous  serial 
CTH  simulations  conducted  by  Hertel  (1992).  Fugelso  and  Taylor  conducted  a  series  of  bal¬ 
listic  experiments  to  evaluate  the  effects  of  combined  obliquity  and  yaw  on  high-density  long 
rod  projectiles.  Depleted  uranium  (DU)  alloy  long  rod  projectiles  with  little  or  no  yaw  were 
launched  into  an  oblique,  rolled  homogeneous  armor  (RHA)  plate  that  had  been  accelerated 
by  an  explosive  charge,  resulting  in  a  yawed  impact  in  the  plate  frame  of  reference.  The 
DU  alloy  (DU  0.75%Ti)  projectiles  were  right  circular  cylinders  with  a  hemisperical  nose, 
and  the  impact  velocities  ranged  from  0.85  to  1.65  km/s.  Yaw  and  obliquity  angles  ranged 
from  0  to  70®  and  10  to  0®,  respectively,  in  the  test  series.  The  length  and  diameter  of  the 
projectile  in  Shot  58  of  the  test  series  are  7.67  cm  and  0.767  cm,  respectively,  for  a  length- 
to-diameter  ratio  (L/D)  of  10.  The  striking  velocity  was  1.289  km/s  and  the  thickness  of 
the  RHA  was  6.4  mm.  In  the  laboratory  frame  of  reference,  the  angle  of  obliquity  was  73.5®, 
the  plate  velocity  was  0.217  km/s,  and  the  projectile  velocity  was  1.21  km/s.  In  the  plate 
frame  of  reference,  the  angle  of  obliquity  was  64.2®,  the  projectile  velocity  was  1.289  km/s, 
and  the  yaw  angle  was  -9.3®.  A  schematic  of  the  initial  conditions  for  Shot  58  is  illustrated 
in  Figure  2. 


The  scalability  study  was  conducted  with  a  constant  workload  (i.e.,  number  of  com¬ 
putational  cells  on  each  processor  for  each  of  the  simulations).  This  was  done  to  keep  the 
computation-to-communication  ratio  constant  for  simulations  involving  different  numbers 
of  processors.  Maintaining  a  constant  computation-to-communication  ratio  and  eliminat¬ 
ing  disk  access  for  intermediate  plot  and  restart  files  during  the  time  integration  permitted 
the  computational  performance  to  be  isolated  and  measured  as  a  function  of  the  number  of 
processors  used. 

The  single-processor  baseline  calculation  used  a  Cartesian  computational  domain  span¬ 
ning  21.5  cm  in  the  X  direction,  3.0  cm  in  the  Y  direction,  and  6.0  cm  in  the  Z  direction. 
The  computational  domain  was  discretized  into  uniform  cubic  zones  1  mm  long,  resulting  in 
a  3-D  grid  of  215  by  30  by  60.  As  the  number  of  processors  in  a  simulation  increased,  the 
number  of  zones  in  the  model  increased  accordingly  to  maintain  a  nearly  constant  number 
of  computational  zones  per  processor. 

All  calculations  were  conducted  for  a  simulated  time  of  40  jus.  The  grid  was  incrementally 
refined  by  uniformly  decreasing  the  characteristic  zone  length  in  each  coordinate  direction 
by  a  factor  of  2“^^^.  This  approach  doubles  the  total  number  of  grid  points  with  each 
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successive  mesh  refinement.  The  characteristics  of  the  grids  used  in  the  scalability  study 
are  summarized  in  Table  1.  In  this  table,  the  columns  NI,  NJ,  and  NK  refer  to  the  number 
of  Eulerian  cells  in  the  x,  y,  and  2  directions,  respectively,  and  do  not  include  ghost  cells. 
An  alternative  to  this  mesh  refinement  technique  would  be  to  double  the  number  of  zones 
in  one  direction  for  one  refinement,  then  double  the  number  of  zones  in  another  direction 
for  the  next  refinement,  and  so  on.  This  approach  would  reduce  the  time  step  by  a  factor 
of  two  on  the  first  refinement  and  would  double  the  number  of  time  integration  cycles  (i.e., 
computational  cycles)  to  reach  the  desired  simulation  time  of  40  ns.  The  method  of  uniform 
zone  size  reduction  resulted  in  a  reduction  of  the  time  step  by  a  factor  of  2  with  each 
refinement.  As  a  result,  the  number  of  computational  cycles  required  to  reach  40  /is  of 
simulated  time  increased  only  hy  a  factor  of  approximately  2^^^  each  time  the  number  of 
processors  doubled. 


Table  1.  Computational  Grids  Used  in  Scalability  Study 


Number  of 
Processors 

NI 

NJ 

NK 

Total 
Number 
of  Zones 

Average 
Zones  per 
Processor 

Zone 

Length 

(mm) 

1 

215 

30 

60 

387,000 

387,000 

1.00 

2 

271 

38 

75 

772,350 

386,175 

0.80 

4 

341 

48 

95 

1,554,960 

.388,740 

0.63 

8 

430 

60 

120 

3,096,000 

387,000 

0.50 

16 

541 

76 

151 

6,208,516 

388,032 

0.40 

32 

683 

95 

191 

12,393,035 

387,282 

0.31 

48 

781 

109 

218 

18,558,122 

386,628 

0.28 

64 

860 

120 

240 

24,768,000 

387,000 

0.25 

96 

985 

137 

278 

37,514,710 

390,778 

0.22 

The  scalable  performance  of  the  message-passing  code  is  measured  by  the  grind  time, 
which  is  the  average  processor  time  required  for  the  code  to  revise  all  flow  field  variables 
for  one  computational  cell  in  a  given  time  increment  (cycle).  The  grind  time  is  expressed  in 
units  of  //s/(zone  cycle).  In  a  case  of  ideal  scalability,  the  grind  time  will  decrease  by  a  factor 
of  two  for  every  doubling  of  processors  used  if  the  ratio  of  computation  to  communication  is 
held  constant. 


5.  SCALABILITY  RESULTS 


Three  sets  of  calculations  were  performed  to  assess  the  scalable  performance  of  the  Sun 
HPC  10000  system.  The  first  set  of  calculations  was  run  on  one  of  the  64-processor  systems 
at  the  ARL  MSRC.  These  calculations  were  performed  with  the  processor  and  problem  size 
combinations  described  in  Table  1.  The  results  of  this  set  of  simulations  are  represented  by 
the  square  symbols  in  Figure  3.  Also  visible  in  the  plot  are  two  straight  lines.  ^  The  first 
straight  line  is  a  line  of  ideal  scalability  that  uses  the  single-processor  result  as  its  anchor 
point.  The  second  straight  line  is  like  the  first,  except  that  it  uses  the  eight-processor  result 
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as  its  anchor  point.  The  results  of  this  first  set  of  simulations  show  that  the  two-  and  four- 
processor  calculations  almost  exactly  match  the  ideal  performance.  The  calculations  using 
eight  or  more  processors  fall  slightly  off  the  single-processor  ideal  scaling  line.  However, 
comparing  these  results  to  the  eight-processor  ideal  scaling  line  shows  that  the  system  scales 
linearly  up  to  48  processors.  The  64-processor  calculation  falls  slightly  off  the  eight-processor 
ideal  scaling  line.  This  performance  penalty  is  caused  by  contention  between  the  operating 
system  and  the  application.  By  using  all  64  processors  in  the  system,  the  application  has  to 
compete  with  the  operating  system  for  system  resources,  which  results  in  a  slight  degradation 
in  performance. 


Figure  3.  Ideal  and  Measured  Scalability  of  CTH  on  the  Sun  HPC  10000 

The  shift  in  scalability  that  occurs  between  the  four-  and  eight-processor  results  can 
most  likely  be  attributed  to  the  system  processor  configuration.  The  64  processors  reside  on 
16  system  boards,  each  containing  four  processors  and  4  GB  of  memory.  While  there  is  no 
guarantee  that  these  two-  and  four-processor  calculations  ran  on  a  single  system  board,  one 
would  expect  such  “on  board”  calculations  to  be  the  most  efficient  because  they  do  not  need 
to  use  the  crossbar  interconnect  for  passing  data  between  processors.  If  the  two-  and  four- 
processor  calculations  were  actually  run  on  a  single  system  board,  then  the  calculations  using 
eight  processors  or  more  would  all  have  used  the  crossbar  interconnect  for  message  traffic. 
The  linear  scalability  observed  from  8  to  48  processors  shows  that  the  crossbar  interconnect 
was  capable  of  handling  the  flow  of  data  between  processors  as  the  problem  size  and  number 
of  processors  used  increased  and  that  the  memory  access  was  truly  symmetrical. 

As  stated  earlier,  requirements  for  enhanced  simulation  fidelity  are  constantly  increasing. 
As  the  grid  resolution  is  increased  in  explicit,  finite  volume  simulations,  the  problem  size  and 
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corresponding  work  load  increase  dramatically.  To  solve  very  large  problems,  it  is  necessary 
to  run  simulations  using  numbers  of  processors  that  are  greater  than  what  is  currently 
available  in  a  single  symmetric  multi-processor  system.  To  address  this  need,  large  SMP 
systems  may  be  “clustered”  to  create  a  very  large  system  for  solving  highly  resolved  problems. 
The  second  and  third  sets  of  simulations  were  performed  to  determine  the  scalability  that 
could  be  obtained  by  clustering  two  Sun  HPC  10000  systems.  For  these  trials,  the  ATM 
interface  was  used  for  the  transfer  of  data  between  the  two  systems.  This  interface  has  a 
maximum  data  transfer  rate  of  622  Mb/s. 

The  ATM  interfaces  were  configured  to  communicate  using  data  packets  that  are  1,500  bytes 
in  size  for  the  second  set  of  trials.  This  relatively  small  packet  size  resulted  in  a  requirement 
to  transfer  a  large  number  of  packets  for  large  problems.  The  results  of  these  simulations 
are  represented  by  the  circle  markers  in  the  scalability  plot  of  Figure  3.  A  comparison  of 
these  markers  to  the  square  symbols  for  the  single-system  trials  shows  that  the  performance 
is  degraded  as  the  problem  size  (and  number  of  processors)  increases.  For  the  64-processor 
simulation  (which  uses  a  problem  size  of  approximately  24.7  million  Eulerian  zones),  the 
grind  time  increased  by  47%  over  the  same  calculation  run  on  a  single  system.  The  perfor¬ 
mance  limitation  in  these  simulations  is  the  transfer  of  data  between  the  two  SMP  systems 
through  the  ATM  interface. 

In  an  attempt  to  improve  the  communication  performance  between  the  two  SMP  sys¬ 
tems,  the  ATM  interface  was  reconfigured  to  use  an  increased  packet  size  of  9,218  bytes.  The 
third  and  final  set  of  simulations  was  run  using  this  larger  packet  size,  and  the  results  are 
represented  by  the  diamond  symbols  in  Figure  3.  These  results  show  a  noticeable  improve¬ 
ment  in  the  performance  for  large  problem  sizes.  For  the  simulation  using  64  processors, 
the  grind  time  was  20%  greater  than  the  single-system  calculation.  While  the  large  packet 
results  do  not  match  those  of  the  single-system  performance,  they  provide  a  valuable  initial 
demonstration  of  the  ability  to  effectively  use  large,  clustered  SMP  systems  to  solve  problems 
using  large  numbers  of  processors. 

The  computational  performance  of  CTH  on  any  computer  architecture  is  irrelevant  if  the 
results  are  incorrect.  To  verify  the  accuracy  of  the  results  computed  on  the  Sun  HPC  10000, 
the  single-processor  and  the  48-processor  simulations  were  extended  to  a  simulated  time 
of  100  fis  and  compared  to  experimental  data.  Data  obtained  from  the  experiment  were 
the  residual  length  and  velocity  of  the  rod  after  passing  through  the  target  plate  (Hertel 
1992).  The  initial  impact  conditions  are  provided  in  Figure  2.  The  residual  rod  length  and 
velocity  in  the  experiment  were  5.55  cm  and  1069  m/s,  respectively.  The  single-processor 
calculation,  which  employed  a  uniform  zone  size  of  1  mm,  produced  a  residual  rod  length  of 
5.93  cm  (6.8%  greater  than  the  experiment)  and  a  residual  velocity  of  1002  m/s  (6.3%  lower 
than  the  experiment).  The  48-processor  simulation  employed  a  uniform  zone  size  of  0.28  mm 
and  resulted  in  a  residual  rod  length  of  6.14  cm  (10.6%  greater  than  the  experiment)  and 
a  residual  rod  velocity  of  1020  m/s  (4.6%  lower  than  the  experiment).  Figure  4  provides 
an  illustration  of  the  finite  plate  perforation  at  100  /is  for  both  calculations.  The  upper 
left  side  of  each  image  shows  the  target  plate  after  it  was  perforated  by  the  penetrator. 
The  simulation  was  set  up  with  the  rod  impacting  the  middle  of  the  plate.  As  a  result, 
the  calculation  used  a  plane  of  symmetry  about  the  point  of  impact.  In  the  figure,  the  rod 
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Figure  4.  Finite  Plate  Perforation  at  100  //s:  a.  1-mm  Zones,  b.  0.28-inm  Zones 
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6.  SUMMARY 


In  a  previous  effort,  the  CTH  hydrodynamics  code  was  adapted  to  an  SPMD  program¬ 
ming  paradigm  to  exploit  large,  scalable  computer  architectures.  This  paradigm  involves  the 
decomposition  of  the  structured  mesh  into  computational  sub-domains,  with  explicit  mes¬ 
sage  passing  used  to  communicate  data  between  the  multiple  processes  used  in  solving  the 
problem.  This  method  has  been  demonstrated  to  scale  linearly  as  the  number  of  processors 
and  corresponding  problem  size  increased. 

A  new  system  architecture  in  the  HPC  arena  is  the  Sun  HPC  10000  system.  A  series  of 
numerical  simulations  was  performed  to  assess  the  scalability  of  CTH  on  this  system.  Three 
sets  of  trials  were  performed.  The  first  set  was  limited  to  a  single  system  containing  64 
processors.  The  study  demonstrated  that  the  application  scales  linearly  on  this  system  to  48 
processors.  However,  the  performance  of  the  application  on  64  processors  suffered  a  slight 
performance  degradation  as  a  result  of  contention  with  the  operating  system. 

Two  additional  sets  of  simulations  were  run  to  assess  the  scalability  of  large,  clustered 
SMP  systems  to  solve  large  problems  using  large  numbers  of  processors.  Data  communica¬ 
tion  between  SMP  systems  for  these  trials  used  packet  sizes  of  1,500  bytes  and  9,218  bytes. 
The  1,500-byte/packet  trials  suffered  a  noticeable  performance  degradation  as  the  problem 
size  and  resulting  message  traffic  increased.  Tests  using  the  larger  packet  size  showed  a  sig¬ 
nificant  improvement  in  performance  for  the  large  problems.  While  the  performance  in  these 
simulations  did  not  match  that  of  the  single-system  simulations,  these  trials  demonstrated 
the  ability  to  cluster  large  SMP  systems  to  solve  very  large  problems. 

The  results  described  in  this  report  do  not  provide  enough  information  to  determine 
whether  these  results  will  continue  to  scale  for  large  groups  of  SMP  systems.  Additional 
trials  with  at  least  four  SMP  systems  would  provide  a  better  indication  of  the  ability  to  use 
large  SMP  systems  for  this  purpose.  While  the  two  sets  of  multiple  system  sets  show  an 
improvement  in  performance  with  increased  packet  size,  this  configuration  is  by  no  means 
optimal.  Additional  work  is  needed  to  determine  the  best  type  of  dedicated  network  config¬ 
uration  to  use  for  building  a  cluster  of  this  particular  system  for  solving  very  large  problems. 
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